Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SE-3520] Fixes Transcripts Incompletely Uploaded to S3 Bucket #266

Merged

Conversation

nizarmah
Copy link
Contributor

@nizarmah nizarmah commented Oct 24, 2020

There's an issue that has been happening when the video transcripts are being uploaded to S3. The video transcripts that get added through Import Course don't end up getting uploaded to the S3 because of the following error:

S3ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>819c27ac3c56eaa102afe74e65c197f4</ExpectedDigest><CalculatedDigest>YY0oWmnYyOhDbh5Q7POe8A==</CalculatedDigest><RequestId>B92FCA54F14E6186</RequestId><HostId>Vbx/LS1kB4m7h9Aqcn/hiRAaT4zg7nx4sr/s8EAbTfZJFBxJqtNXCt5j4IQaOHExItVCwrPKcZc=</HostId></Error>

Accordingly, this seemed like an existing issue with boto, which requires specifying the file encoding.
More context on the issue can be found in the linked discussion, below.

JIRA tickets: OSPR-5084, SE-3520

Discussions: boto/boto#2868

Installation instructions:

  1. Change the user to the edxapp user, by sudo -Hu edxapp bash.
  2. Install the branch by using the one liner shown below.
  3. Restart cms or all services.
cd && . edxapp_env && . venvs/edxapp/bin/activate && pip uninstall edxval -y && pip install -e git+https://github.com/open-craft/edx-val.git@nizar/s3_content_md5_mismatch#egg=edxval

Testing instructions:

Make sure that your instance uses an S3 Bucket for storage.

  1. Install the edx-val by following the Installation Instructions.
  2. Import a course with a video and a video transcript.
  3. Delete the video transcript in the Edxval > Video Transcripts
  4. Import the course again.
  5. View the course and download the transcript.
  6. Verify that the transcript is correct.

For testing, you can use the Course attached below. The video id in the course attached is 1bb0d9fc-65f0-4b5c-a55b-599ca3c8f425 (will be useful for searching for the video transcript to delete it)

test_course.tar.gz

Author notes & concerns:

  1. I noticed that uploading a video with a new transcript, for a video that already exists and a transcript that already exists, does not update the previously existing transcript, cf these lines. Could you please provide any explanation as to why that's done like that? (@mushtaqak)

Reviewers

@openedx-webhooks openedx-webhooks added needs triage open-source-contribution PR author is not from Axim or 2U labels Oct 24, 2020
@openedx-webhooks
Copy link

Thanks for the pull request, @nizarmah! I've created OSPR-5084 to keep track of it in JIRA, where we prioritize reviews. Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket:

  • supporting documentation
  • Open edX discussion forum threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

@nizarmah nizarmah force-pushed the nizar/s3_content_md5_mismatch branch from efa8cc4 to 17dae12 Compare October 24, 2020 16:57
@natabene
Copy link

@nizarmah Thank you for your contribution. Please let me know once it is ready for our review.

@openedx-webhooks openedx-webhooks added waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. and removed needs triage labels Oct 28, 2020
Copy link
Contributor

@DawoudSheraz DawoudSheraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a relevant unit test. Thanks

@nizarmah
Copy link
Contributor Author

@natabene I was going to say that this is ready for your review. Thanks for moving along with it, I really appreciate it!

@DawoudSheraz I'll definitely add the relevant unit tests 🙂 I just have a small question that's kind of separate from the unit tests.


Would appreciate any help regarding this, @natabene @DawoudSheraz:
So, as I mentioned in the PR description, currently edx-val doesn't support overwriting existing transcripts.

I identified the code, and fixed it so that it does overwrite existing transcripts. Now the questions I had were, should I open a new PR for that? Or would you prefer I include it in this PR? Or is it something that edX intentionally did like that and isn't interested in making it possible to overwrite transcripts?

@DawoudSheraz
Copy link
Contributor

@nizarmah Hi. Thanks for the follow-up. Regarding your question about not supporting overwriting, that code addition was intentional. If a video already exists on an instance, re-uploading is merely creating and uploading the duplicate of an already existing file. There can be instances where the transcripts are edited manually in the export but that ideally should be uploaded from Studio/Video block to keep the data consistent.

@nizarmah
Copy link
Contributor Author

@DawoudSheraz ah that makes it a little bit more difficult. Thanks for the extremely quick reply btw, I really appreciate it!

So I guess this makes me curious, would you consider upstreaming such a change if we compare the transcript data that is being uploaded with the already existing one? Maybe we can add a hidden content hash for each video transcript and use it to verify if the data changed or not?
That way, if the data is identical, we wouldn't be uploading the duplicate, but if it isn't then we'd be able to update the transcript.

But if you would prefer not to upstream such a change, I totally understand. I'm just asking so that I know that I tried my best to upstream such a change 😅 hahaha.

@DawoudSheraz
Copy link
Contributor

@nizarmah It can be done in a follow-up OSPR but an internal evaluation will be needed if the change is even needed based on the use cases. Thanks

@nizarmah nizarmah force-pushed the nizar/s3_content_md5_mismatch branch from 3610c7c to 047288d Compare October 28, 2020 12:51
@nizarmah
Copy link
Contributor Author

Alright @DawoudSheraz I added a unit test to make sure that the ContentFile content is actually utf-8 encoded.

So, to validate that works, I did the following change in edxval/api.py:

        # Create transcript record.
        create_video_transcript(
            video_id=edx_video_id,
            language_code=language_code,
            file_format=file_format,
-            content=ContentFile(file_content.encode('utf-8')),
+            content=ContentFile(file_content),
            provider=provider
        )

This resulted in the following error:

>       content_encoding = chardet.detect(transcript_content.read())['encoding']

edxval/tests/test_api.py:1934: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

byte_str = 'Hello, edX greets you. random utf-8 characters: éâô'

    def detect(byte_str):
        """
        Detect the encoding of the given byte string.
    
        :param byte_str:     The byte sequence to examine.
        :type byte_str:      ``bytes`` or ``bytearray``
        """
        if not isinstance(byte_str, bytearray):
            if not isinstance(byte_str, bytes):
                raise TypeError('Expected object of type bytes or bytearray, got: '
>                               '{0}'.format(type(byte_str)))
E               TypeError: Expected object of type bytes or bytearray, got: <class 'str'>

.tox/py35-django22/lib/python3.5/site-packages/chardet/__init__.py:34: TypeError

Let me know if you'd like me to add anything else 🙂 👍

Copy link
Contributor

@DawoudSheraz DawoudSheraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the version in setup.py before merging

@nizarmah
Copy link
Contributor Author

That's been done @DawoudSheraz 👍

Thanks a lot for your review 🙂

Copy link

@gabor-boros gabor-boros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 🎉

  • I tested this: Checked the code working in our dev server
  • I read through the code
  • I checked for accessibility issues NA
  • Includes documentation NA
  • I made sure any change in configuration variables is reflected in the corresponding client's configuration-secure repository. NA

@nizarmah
Copy link
Contributor Author

nizarmah commented Nov 5, 2020

@natabene @DawoudSheraz is there anything blocking this PR from getting merged? Please let me know 👍

@openedx-webhooks openedx-webhooks added engineering review and removed waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. labels Nov 5, 2020
@natabene
Copy link

natabene commented Nov 5, 2020

@DawoudSheraz If you think this is good to be merged, please merge - community authors don't have permissions to do so.

@DawoudSheraz DawoudSheraz merged commit dd8b424 into openedx:master Nov 5, 2020
@openedx-webhooks
Copy link

@nizarmah 🎉 Your pull request was merged!

Please take a moment to answer a two question survey so we can improve your experience in the future.

@DawoudSheraz
Copy link
Contributor

@nizarmah I have merged and created a release tag https://github.com/edx/edx-val/releases/tag/1.4.3. The changes will soon be part of the platform after a requirements update. Thanks for your contribution.

@nizarmah
Copy link
Contributor Author

nizarmah commented Nov 5, 2020

Thanks a lot! 😄

@nizarmah nizarmah deleted the nizar/s3_content_md5_mismatch branch November 5, 2020 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged open-source-contribution PR author is not from Axim or 2U
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants